Bayesian multitask inverse reinforcement 

learning 



Christos Dimitrakakis 1 and Constantin A. Rothkopf 2 

1 EPFL, Lausanne, Switzerland 
christos . dimitrakakis@epf 1 . ch 
2 Frankfurt Institute for Advanced Studies, Frankfurt, Germany 
rothkf opf @f ias . uni-f rankf urt . de 



Abstract. We generalise the problem of inverse reinforcement learning 
to multiple tasks, from multiple demonstrations. Each one may repre- 
sent one expert trying to solve a different task, or as different experts 
trying to solve the same task. Our main contribution is to formalise the 
problem as statistical preference elicitation, via a number of structured 
priors, whose form captures our biases about the relatedness of different 
tasks or expert policies. In doing so, we introduce a prior on policy op- 
timality, which is more natural to specify. We show that our framework 
allows us not only to learn to efficiently from multiple experts but to also 
effectively differentiate between the goals of each. Possible applications 
include analysing the intrinsic motivations of subjects in behavioural 
experiments and learning from multiple teachers. 

Key words: Bayesian inference, intrinsic motivations, inverse reinforce- 
ment learning, multitask learning, preference elicitation 

1 Introduction 

This paper deals with the problem of multitask inverse reinforcement learning. 
Loosely speaking, this involves inferring the motivations and goals of an unknown 
agent performing a series of tasks in a dynamic environment. It is also equivalent 
to inferring the motivations of different experts, each attempting to solve the 
same task, but whose different preferences and biases affect the solution they 
choose. Solutions to this problem can also provide principled statistical tools for 
the interpretation of behavioural experiments with humans and animals. 

While both inverse reinforcement learning, and multitask learning are well 
known problems, to our knowledge this is the only principled statistical for- 
mulation of this problem. Our first major contribution generalises our previous 
work [2(ij ]. a statistical approach for single-task inverse reinforcement learning, 
to a hierarchical (population) model discussed in Section [3] Our second major 
contribution is an alternative model, which uses a much more natural prior on 
the optimality of the demonstrations, in Section [4J for which we also provide 
computational complexity bounds. An experimental analysis of the procedures 
is given in Section [SJ while the connections to related work are discussed in 
Section [5] Auxiliary results and proofs are given in the appendix. 
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2 The general model 

We assume that all tasks are performed in an environment with dynamics drawn 
from the same distribution (which may be singular) . We define the environment 
as a controlled Markov process (CMP) v = (S,A,T), with state space S, action 
space A, and transition kernel T = { r(- | s,a) : s G S, a G A }, indexed in S x A 
such that t(- | s, a) is a probability measurqfl on iS. The dynamics of the environ- 
ment are Markovian: If at time t the environment is in state Sj G S and the agent 
performs action a t G A, then the next state St+i is drawn with a probability 
independent of previous states and actions: ¥„(st+i G S | s*,a') = r(5 | s t ,a t ), 
S C S, where we use the convention s* = si,...,s t and a* = ai,...,a t to 
represent sequences of variables, with 5*, A 1 being the corresponding product 
spaces. If the dynamics of the environment are unknown, we can maintain a 
belief about what the true CMP is, expressed as a probability measure ui on the 
space of controlled Markov processes TV". 

During the m-th demonstration, we observe an agent acting in the envi- 
ronment and obtain a T m -long sequence of actions and a sequence of states: 
d m — (a^r ) s nr): a m" — a m,i, * * * ) a m ,T, s^" = s m ,u s m,T m ■ The m-th task 
is defined via an unknown utility function, U m j, according to which the demon- 
strator selects actions, which we wish to discover. Setting U m> t equal to the total 
discounted return|f| we establish a link with inverse reinforcement learning: 

Assumption 1 The agent's utility at time t is defined in terms of future re- 
wards: U m ,t — Y^kLt r y k ' r k, where 7 G [0, 1] is a discount factor, and the reward 
r t is given by the reward function p m : S x A — > 1R so that r t = p m {st,at). 

In the following, for simplicity we drop the subscript m whenever it is clear 
by context. For any reward function p, the controlled Markov process and the 
resulting utility U define a Markov decision process [ItJ (MDP), denoted by 
fi = (v, p,j). The agent uses some policy it to select actions at ~ tt(- | s*,a t_1 ), 
which together with the Markov decision process p, defines a distributional on the 
sequences of states, such that P MiW (s t+ i G S \ sSa*" 1 ) = J a t(S \ a,s t )dir(a \ 
s',a* -1 ), where we use a subscript to denote that the probability is taken 
with respect to the process defined jointly by p, it. We shall use this nota- 
tional convention throughout this paper. Similarly, the expected utility of a pol- 
icy 7r is denoted by Ut- We also introduce the family of Q- value functions 
{ : M € M,tt G V }, where M is a set of MDPs, with : S x A -> R such 

that: Q^(s,a) = E Mi7r (U t \ s t = s,a t = a). Finally, we use Q* to denote the op- 
timal Q-value function for an MDP p, such that: Q*(s,a) = sup^gp Q^(s, a), 
Vs G S, a G A. With a slight abuse of notation, we shall use Q p when we only 

3 We assume the measurability of all sets with respect to some appropriate a-algebra. 

4 Other forms of the utility are possible. For example, consider an agent who collects 
gold coins in a maze with traps, and where the agent's utility is the logarithm of the 
number of coins it has after it has exited the maze. 

5 When the policy is reactive, then 7r(a t j s',a' _1 ) = ir(a t \ s t ), and the process 
reduces to first order Markov. 
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Fig. 1. Graphical model of general multitask reward-policy priors. Lighter colour 
indicates latent variables. Here rj is the hyperprior on the joint reward-policy 
prior <j) while p m and 7r m are the reward and policy of the m-th task, for which we 
observe the demonstration d m . The undirected link between 7r and p represents 
the fact that the rewards and policy are jointly drawn from the reward-policy 
prior. The implicit dependencies on v are omitted for clarity. 



need to distinguish between different reward functions p, as long as the remaining 
components of p are clear from the context. 

Loosely speaking, our problem is to estimate the sequence of reward func- 
tions p = pi, . . . , p m , . . . ,pm, and policies n = 7Ti, . . . , 7r TO , . . . , ttm, which were 
used in the demonstrations, given the data D — d\, . . . , d m , . . . , du from all 
demonstrations and some prior beliefs. In order to do this, we define a multitask 
reward-policy prior distribution as a Bayesian hierarchical model. 



2.1 Multitask priors on reward functions and policies 

We consider two types of priors on rewards and policies. Their main difference 
is how the dependency between the reward and the policy is modelled. Due to 
the multitask setting, we posit that the reward function is drawn from some 
unknown distribution for each task, for which we assert a hyperprior, which 
is later conditioned on the demonstrations. The hyperprior rj is a probability 
measure on the set of joint reward-policy priors ^ . It is easy to see that, given 
some specific <f> € ^ , we can use Bayes' theorem directly to obtain, for any 
A C V M , B C 7Z M , where V M ,TZ M are the policy and reward product spaces: 

<p(A,B D) = — — — — = M <j) p m ,7r m dm). 

Jkm xV m <f>(D | p,7l)d<P(p,7l) I* 

When cj) is not specified, we must somehow estimate some distribution on it. 
In the empirical Bayes case [l9| the idea is to simply find a distribution 77 in 
a restricted class H, according to some criterion, such as maximum likelihood. 
In the hierarchical Bayes approach, followed herein, we select some prior rj and 
then estimate the posterior distribution rj (• | D). 



We consider two models. In the first, discussed in Section[3 on the following page 



we initially specify a product prior on reward functions and on policy parameters. 
Jointly, these determine a unique policy, for which the probability of the observed 
demonstration is well-defined. The policy-reward dependency is exchanged in the 
alternative model, which is discussed in Section [4 on page 6| There we specify a 
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product prior on policies and on policy optimality. This leads to a distribution 
on reward functions, conditional on policies. 



3 Multitask Reward-Policy prior (MRP) 

Let 1Z be the space of reward functions p and V the space of policies 7r. Let 
tp(- | u) G M denote a conditional probability measure on the reward functions 
1Z such that for any B C 1Z, ip(B \ v) corresponds to our prior belief that the 
reward function is in B, when the CMP is known to be v. For any reward function 
p E 1Z, we define a conditional probability measure £(• | p, v) G & on the space 
of policies V . Let p m , 7r m denote the m-th demonstration's reward function and 
policy respectively. We use a product^ hyperprioi0 rj on the set of reward function 
distributions and policy distribution M x Z? 1 , such that r)(\P, S) — f](^)f](S) for 
all W C 5 C Our model is specified as follows: 

~ rj(- | v), p m | t/>, v ~ | v), ir m | i,v,p m ~ C(- I Pm,^), (3.1) 

In this case, the joint prior on reward functions and policies can be written as 
(f>(P, R\v)= J R £(P | p, v) dip(p | v) with P CP, i? C ft, such that </>(• | i/) is a 
probability measure on V x 7Z for any CMP z/|f] In our model, the only observable 
variables are r], which we select ourselves and the demonstrations D. 



3.1 The policy prior 

The model presented in this section involves restricting the policy space to a 
parametric form. As a simple example, we consider stationary soft-max policies 
with an inverse temperature parameter c: 

/ | v / I \ a exp(cQ*(s t ,at)) . . 

TT(a t s t ,p,c) =Softma K (a t s t ,p,,c) = = ——- (3.2) 

>J a exp(cQ;(s t ,a)) 

where we assumed a finite action set for simplicity. Then we can define a prior 
on policies, given a reward function, by specifying a prior /? on c. Inference 
can be performed using standard Monte Carlo methods. If we can estimate the 
reward functions well enough, we may be able to obtain policies that surpass the 
performance of the demonstrators. 



Even if a prior distribution is a product, the posterior may not necessarily remain a 
product. Consequently, this choice does not imply the assumption that rewards are 
independent from policies. 

In order to simplify the exposition somewhat, while maintaining generality, we usu- 
ally specify distributions on functions or other distributions directly, rather than on 
their parameters. 

If the CMP itself is unknown, so that we only have a probabilistic belief ui on J\f, we 
can instead consider the marginal 0(P, R\ u) = JV, <t>(P, R \ v) dui(v). 
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3.2 Reward priors 

In our previous work 2C|, we considered a product-Beta distribution on states 
(or state-action pairs) for the reward function prior. Herein, however, we develop 
a more structured prior, by considering reward functions as a measure on the 
state space S with p(S) = 1. Then for any state subsets Si,S2 C S such that 
Si(~]S2 — 0, p(5'iUS , 2) = p(Si) + p(S2l- A well-known distribution on probability 
measures is a Dirichlet process . Consequently, when S is finite, we can use a 
Dirichlct prior for rewards, such that each sampled reward function is equivalent 
to multinomial parameters. This is more constrained than the Beta-product 
prior and has the advantage of clearly separating the reward function from the 
c parameter in the policy model. It also brings the Bayesian approach closer to 
approaches which bound the L\ norm of the reward function such as plj ]. 

3.3 Estimation 

The simplest possible algorithm consists of sampling directly from the prior. In 
our model, the prior on the reward function p and inverse temperature c is a 
product, and so we can simply take independent samples from each, obtaining 
an approximate posterior on rewards an policies, as shown in Alg. [T] While 
such methods are known to converge asymptotically to the true expectation 
under mild conditions [12[ , stronger technical assumptions are required for finite 
sample bounds, due to importance sampling in step |8l 



Algorithm 1 MRP-MC: Multitask Reward-Policy Monte Carlo. Given the data 
D, we obtain ?), the approximate posterior on the reward-policy distirbution, and 
p m , the 77-expected reward function for the m-th task. 
1: for k = 1,...,K do 

2: ^) = ^),^) ) r, r ,,^ = g amma (g[ k \ 9 ^). 
3: for m — 1, . . . , M do 

4 (k) & , I \ (fc) / (fc) (k)\ 

4: p,V ~ £(p I V), C m ' ~ Qamma{g\ , g\ ') 

c C0 1 (*0\ CO r 1 CO C0\ CO CO/ t 1 T\ 

5: p m = p m ), TT m — Softmaz(- I ', fJ-m , C m J, p m =7T m (a m |S m J 

6: end for 
7: end for 

9: 77(5 I D) = Ef-iljV fc) £ B^q {k \ for B C K x V. 

10: Pra =Ef =1 PmV fc \ m= 1, . . . , M . 



An alternative, which may be more efficient in practice if a good proposal 
distribution can be found, is to employ a Metropolis-Hastings sampler instead, 
which we shall refer to as MRP-MH. Other samplers, including a hybrid Gibbs 
sampler, hereafter refered to as MRP-Gibbs, such as the one introduced in [2(| 
are possible. 
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4 Multitask Policy Optimality prior (MPO) 

Specifying a parametric form for the policy, such as the softmax, is rather awk- 
ward and hard to justify. It is more natural to specify a prior on the optimality 
of the policy demonstrated. Given the data, and a prior over a policy class (e.g. 
stationary policies), we obtain a posterior distribution on policies. Then, via a 
simple algorithm, we can combine this with the optimality prior and obtain a 
posterior distribution on reward functions. 

As before, let D be the observed data and let £ be a prior probability measure 
on the set of policies V, encoding our biases towards specific policy types. In 
addition, let { | tt) : tt £ V } be a set of probability measures on TZ, indexed 
in V, to be made precise later. In principle, we can now calculate the marginal 
posterior over reward functions p given the observations D, as follows: 

ip(B | D) = I xp{B | tt) d£(vr | D), BCK. (4.1) 
Jv 

The main idea is to define a distribution over reward functions, via a prior 
on the optimality of the policy followed. The first step is to explicitly define 
the measures on TZ in terms of e-optimality, by defining a prior measure /3 on 
R+, such that /3([0,e]) is our prior that the policy is e-optimal. Assuming that 
/3(e) = /3(e \ tt) for all tt, we obtain: 

/>oo 

#k)= / i>(B\E,7r)dp(e), (4.2) 
Jo 

where ip(B | e, tt) can be understood as the prior probability that p G B given 
that the policy tt is e-optimal. The marginal (|4. 1 [) can now be written as: 



fv [f 



rl>{B\D)= / rP(B \ e, tt) d(3(e) d^vr | D) (4.3) 



We now construct tp(- \ e, tt). Let the set of e-optimal reward functions with 
respect to tt be: = { p e TZ : || V* — VT||oo < £ }■ Let A (•) be an arbitrary 
measure on TZ (e.g. the counting measure if TZ is discrete). We can now set: 

^ |£ ' ff) - A( A 5 ff ' BClZ ' (4 ' 4) 

Then A(-) can be interpreted as an (unnormalised) prior measure on reward 
functions. If the set of reward functions TZ is finite, then a simple algorithm can 
be used to estimate preferences, described below. 

We are given a set of demonstration trajectories D and a prior on policies 
£, from which we calculate a posterior on policies £(• | D). We sample a set of 
K policies 77 = {tt^ : i — 1, . . . , K} from this posterior. We are also given a 
set of reward functions TZ with associated measure A (•). For each policy- reward 
pair {ir^ l \pj) e 77 x TZ, we calculate the loss of the policy for the given reward 
function to obtain a loss matrix: 

L ± [tij]Kx\n\, hi = sup V£ (s) - V p f (s), (4.5) 
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where V*. and Vp. are the value functions, for the reward function pj, of the 
optimal policy and ir^ respectively!^ 

Given samples n^' from £(7r | D), we can estimate the integral (|4.3I) accu- 
rately via ip{B | D) = J2i=i Io° I £,7rW) dj8(e). In addition, note that 
the loss matrix L is finite, with a number of distinct elements at most K x \TZ\. 
Consequently, ip(B \ e, 7p'') is a piece- wise constant function with respect to e. 
Let (£fc)^- X i' K ' be a monotonically increasing sequence of the elements of L. Then 
tp(B | e,7P')) = V'C-S | e', 7rW) for any e,e' e [£fe, £j+i], and: 

if Kx|"R.| 

$(B\D)±Y1 E (4.6) 

i=i fc=i 

Note that for an exponential prior with parameter c, we have /3([£fc, £fc+i]) = 
e -c£fc _ g-cefc+i _ ^ e can now g nc j pti m al policy with respect to the expected 
utility. 

Theorem 1. Let %(■ | D) be the empirical posterior measure calculated via the 
above procedure and assume p takes values in [0, 1] for all p G 1Z. Then, for any 
value function V p , 

M\\V P - | D) < (^2 + , (4.7) 

where the expectation is taken w.r.t the marginal distribution on 1Z. 

This theorem, whose proof is in the appendix, bounds the number of samples re- 
quired to obtain a small loss in the value function estimation, and holds with only 
minor modifications for both the single and multi-task cases for finite 1Z. For the 
multi-task case and general 1Z, we can use MPO-MC (Alg. |2 on the next page[ ), 
to sample N reward functions from a prior. Unfortunately the theorem does not 
apply directly for infinite 7Z. While one could define an e-net on 7Z, and assume 
smoothness conditions, in order to obtain in optimality guarantees for that case, 
this is beyond the scope of this paper. 

5 Experiments 

Given a distribution on the reward functions ip, and known transition distribu- 
tions, one can obtain a stationary policy that is optimal with respect to this 
distribution via value iteration. This is what single-task algorithms essentially 
do, but it ignores differences among tasks. In the multi-task setting, we infer the 
optimal policy for the m-th task. Its Li-loss with respect to the optimal value 
function is £ m (7r^) = J2ses ^p* m ( s ) — ^X«( s )- We are interested m minimising 
the total loss J2 m ^ m across demonstrations. We first examined the efficiency 

9 Again, we abuse notation slightly and employ V p to denote the value function of the 
MDP (v,pj), for the case when the underlying CMP v is known. For the case when 
we only have a belief ui on the set of CMPs M, V Pj refers to the expected utility with 
respect to u, or more precisely Vp. (s) — E„([/t | s t = s, pj, tt) = f^ V* Pj (s) du){v). 
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Algorithm 2 MPO-MC Multitask Policy Optimality Monte Carlo posterior 
estimate 

1: Sample N reward functions pi, ... , pjv ~ ip. 

2: for k = 1,...,K do 

3: (C > V> ) ~ n, where tpw is multinomial over N outcomes. 
4: for m = 1, .. . ,M do 
5: 7r£? ~C (fc) (' I dm)- 

6: end for 
7: end for 

8: Calculate 4> m (- | d m ) from (|4.6I) and {71"™' : k = 1, . . . , K}. 




Fig. 2. Expected loss for two samplers, averaged over 10 3 runs, as the number of 
total samples increases. Fig. 2(b) compares the MRP and MPO models using 
a Monte Carlo estimate. Fig. 2(a) shows the performance of different sampling 
strategies for the MTPP model: Metropolis-Hastings sampling, with different 
numbers of parallel chains and simple Monte Carlo estimation. 



of sampling. Initially, we used the Chain task [8| with 5 states (c.f. Fig. 3(a)), 
7 = 0.95 and a demonstrator using standard model-based reinforcement learn- 
ing with e-greedy exploration policy using e = 10 -2 , using the Dirichlet prior 



on reward functions. As Fig. 2(a) shows, for the MRP model, results slightly 
favour the single chain MH sampler. Figure |2(b)| compares the performance of 
the MRP and MPO models using an MC sampler. The actual computing time 
of MPO is larger by a constant factor due to the need to calculate (|4.6I) . 

In further experiments, we compared the multi-task perfomance of MRP 
with that of an imitator, for the generalised chain task where rewards are sampled 
from a Dirichlet prior. We fixed the number of demonstrations to 10 and varied 
the nnumber of tasks. The gain of using a multi-task model is shown in Fig. |3(b)| 
Finally, we examined the effect of the demonstration's length, independently of 
the number of task. Fig. 3(c)|3(d) show that when there is more data, then MPO 



is much more efficient, since we sample directly from £(7r | D). In that case, the 
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MRP-MC sampler is very inefficient. For reference, we include the performance 
of MWAL and the imitator. 




Fig. 3. Experiments on the chain task, (a) The 3-state version of the task. |(b)| 
Empirical performance difference of MRP-MC and Imit is shown for {1, 2, 5, 10} 
tasks respectively, with 10 total demonstrations. As the number of tasks in- 
creases, so does the performance gain of the multitask prior relative to an imita- 
tor. (c,d) Single-task sample efficiency in the 5-state Chain task with n = 0.2, 
T2 = 0, ra = 1. The data is sufficient for the imitator to perform rather well. How- 
ever, while the MPO-MC is consistently better than the imitator, MRP-MC 
converges slowly. 



The second experiment samples variants of Random MDP tasks [20] , from 
a hierarchical model, where Dirichlet parameters are drawn from a product of 
gamma(l, 10) and task rewards are sampled from the resulting Dirichlets. Each 
demonstration is drawn from a softmax policy with respect to the current task, 
with c e [2,8] for a total of 50 steps. We compared the loss of policies derived 



from MRP-MC, with that of algorithms described in lj| 18, 21j, as well as a 
flat model [2(| • Fig. 4(a) on the following page shows the loss for varying c, when 
the (unknown) number of tasks equals 20. While flat MH can recover reward 
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functions that lead to policies that outperform the demonstrator, the multi-task 
model MRP-MH shows a clear additional improvement. Figure |4(b)| shows that 
this increases with the number of available demonstrations, indicating that the 
task distribution is estimated well. In contrast, RP-MH degrades slowly, due to 
its assumption that all demonstrations share a common reward function. 




Fig. 4. Experiments on random MDP tasks, comparing MTPP-MH with the 
original (RP-MH) s ampler poj|. a demonstrator employing a softmax po licy 
(soft), Policy Walk (Pol Walk) jl8| and Linear Programming (LinProg) [16| 
MWAL [2l|, averaged over 10 2 runs. Fig. 4(a) shows the loss as the inverse soft- 
max temperature c increases, for a fixed number of M = 20 tasks Fig. |4(b)| 
shows the loss relative to the optimal policy as the number of tasks increases, 
for fixed c = 8. There is one 50-step demonstration per task. The error bars 
indicate standard error. 



6 Related work and discussion 

A number of inverse reinforcement learning [H, S ll , IH HI 2il 13 and preference 
elicitation 0, @ approaches have been proposed, while multitask learning itself 
is a well-known problem, for which hierarchical Bayesian approaches are quite 
natural [l3j]. In fact, two Bayesian approaches have been considered for multitask 
reinforcement learning. Wilson et al. [22j consider a prior on MDPs, while Lazaric 
and Ghavamzadeh |14j employ a prior on value functions. 

The first work that we are aware of that performs multi-task estimation of 
utilities is Q , which used a hierarchical Bayesian model to represent relationships 
between preferences. Independently to us, Q recently considered the problem 
of learning for multiple intentions (or reward functions). Given the number of 
intentions, they employ an expectation maximisation approach for clustering. 
Finally, a generalisation of IRL to the multi-agent setting, was examined by 
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Natarajan et al. [15]. This is the problem of finding a good joint policy, for a 
number of agents acting simultaneously in the environment. 

Our approach can be seen as a generalisation of Q to the dynamic set ting 
of inverse reinforcement learning; of to full Bayesian estimation; and of [2(| 
to multiple tasks. This enables significant potential applications. For example, 
we have a first theoretically sound formalisation of the problem of learning from 
multiple teachers who all try to solve the same problem, but which have different 
preferences for doing so. In addition, the principled Bayesian approach allows 
us to infer a complete distribution over task reward functions. Technically, the 
work presented in this paper is a direct generalisation of our previous paper [20] | , 
which proposed single task equivalents of the policy parameter priors discussed 
in Sec. [3J to the multitask setting. In addition to the introduction of multiple 
tasks, we provide an alternative policy optimality prior, which is a not only a 
much more natural prior to specify, but for which we can obtain computational 
complexity bounds. 

In future work, we may consider non-parametric priors, such as those consid- 
ered in [lOj . for the policy optimality model of Sec. 2] Finally, when the MDP is 
unknown, calculation of the optimal policy is in general much harder. However, 
in a recent paper [9| we show how to obtain near-optimal memoryless policies 
for the unknown MDP case, which would be applicable in this setting. 
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A Auxiliary results and proofs 

Lemma 1 (Hoeffding inequality). For independent random variables X\, . . . , X n 
such that Xi £ [oii, bi], with \ii = EX; and t > 0: 

( " " \ / " " \ / 2n 2 t 2 \ 

P [22Xi>22iH + nt =P [22Xi<2^fH-nt <exp — 



E?=i(6i-o*) s 



Corollary 1. Let g : X x Y — > R be a function with total variation \\g\\TV < 
\Jljc, and let P be a probability measure on Y . Define f : X — > R to be f(x) = 
J Y 9(x,y)dP(y)- Given a sample y n ~ P n , let f n (x) = \ YTi=\ gfo t/i)- Then, 

for any S > 0,with probability at least 1 — 5, ||/ — / n ||oo < \J ^cn^ • 

Proof. Choose some x £ X and define the function h x : Y — > [0,1], h x (y) = 
g(x, y). Let ft™ be the empirical mean of h x with y\, . . . , y n ~ P. Then note that 
the expectation of h x with respect to P is E h x = f h x (y)dP(y) — J g(x,y)dP(y) = 
f(x). Then P n {{y n : \ f(x) - f"(x))\ > t}) < 2e~ cnt2 , for any x, due to Hoeffd- 
ing's inequality. Substituting gives us the required result. 
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Proof (Proof of Theorem \l on page 7| ). Firstly, note that the value function has 
total variation bounded 1/(1—7). Then corollary |l on the preceding pagc| applies 
with c = 2(1 — 7) 2 . Consequently, the expected loss can be bounded as follows: 

Setting S — 2j\J~K gives us the required result. 
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